Adaptive Join Plan Generation in Hadoop For CPS296.1 Course Project
نویسندگان
چکیده
Joins in Hadoop has always been a problem for its users: the Map/Reduce framework seems to be specifically designed for group-by aggregation tasks rather than across-table operations; on the other hand, join operation in distributed database systems was never an easy task because data location and skewness makes join strategies harder to optimize. Fragment-replicate join (map join) may be a clever step towards good performance in some cases, but it can be a dangerous move under certain circumstances. This paper introduces some new techniques used in map join to tackle these issues, and proposes a plan generator for the join types that we currently have.
منابع مشابه
Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملJoins for Hybrid Warehouses: Exploiting Massive Parallelism in Hadoop and Enterprise Data Warehouses
HDFS has become an important data repository in the enterprise as the center for all business analytics, from SQL queries, machine learning to reporting. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. This has created the need for a new generation of special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid...
متن کاملDynamic Join Algorithm Switching at Query Execution Time
Join optimization is one of the most challenging tasks in query processing. The performance of joins depends not only on the algebraical/logical query execution plan (QEP), but also on the chosen join algorithms. Static optimization techniques often suffer from outdated or not available statistics on the data. This may result in sub-optimal QEPs and poor query execution times. Adaptive Query Pr...
متن کاملHeads-Join: Efficient Earth Mover's Distance Similarity Joins on Hadoop
The Earth Mover’s Distance (EMD) similarity join has a number of important applications such as near duplicate image retrieval and distributed based pattern analysis. However, the computational cost of EMD is super cubic and consequently the EMD similarity join operation is prohibitive for datasets of even medium size. We propose to employ the Hadoop platform to speed up the operation. Simply p...
متن کاملDistributed Adaptive Windowed Stream Join Processing
This paper presents an adaptive framework for processing a window-based multi-way join query over distributed data streams. The framework integrates distributed plan modification and distributed plan migration within the same scope by using a building block called the node operator set (NOS). An NOS is housed in each node that participates in the join execution, and specifies the set of atomic ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010